Here, we attempt to build a principled Bayesian model for some OOPS data. We illustrate the main ideas of a Bayesian workflow and refer to the accompanying manuscript for relevent details. The first code chunk loads the packages we are interested in.
library(brms)
## Warning: package 'brms' was built under R version 4.0.5
## Loading required package: Rcpp
## Loading 'brms' package (version 2.15.0). Useful instructions
## can be found by typing help('brms'). A more detailed introduction
## to the package is available through vignette('brms_overview').
##
## Attaching package: 'brms'
## The following object is masked from 'package:stats':
##
## ar
library(pheatmap)
library(RColorBrewer)
library(MSnbase)
## Loading required package: BiocGenerics
## Warning: package 'BiocGenerics' was built under R version 4.0.5
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: mzR
## Warning in fun(libname, pkgname): mzR has been built against a different Rcpp version (1.0.5)
## than is installed on your system (1.0.6). This might lead to errors
## when loading mzR. If you encounter such issues, please send a report,
## including the output of sessionInfo() to the Bioc support forum at
## https://support.bioconductor.org/. For details see also
## https://github.com/sneumann/mzR/wiki/mzR-Rcpp-compiler-linker-issue.
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: ProtGenerics
##
## Attaching package: 'ProtGenerics'
## The following object is masked from 'package:stats':
##
## smooth
##
## This is MSnbase version 2.16.1
## Visit https://lgatto.github.io/MSnbase/ to get started.
##
## Attaching package: 'MSnbase'
## The following objects are masked from 'package:ProtGenerics':
##
## executeProcessingStep, ProcessingStep
## The following object is masked from 'package:base':
##
## trimws
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(ggfortify)
bayesplot::color_scheme_set("teal")
Now, we load the data we are interested, which is from Queiroz et al. 2019. The data have already been summarised to protein level and column median normalised. The design used 2 seperate TMT 10-plexes to measure protein abundance and oops sample. This allows us to model changes in RNA binding that are indepedent of concurrent changes in total protien abundance. Changes in oops-enriched protein abundance samples that do not correlated with total protein abundance chances are indicative of specific RBPs binding RNA differentially in those conditions. The experimental design uses thymidine-nocodazole arrest and then measures protein abundance, as well as oops enriched protein abundance at \(0,6\) and \(23\) hours. Triplicate measurements are made, except for at 6 hours where 4 replicates where taken. For further details see Queiroz et al. 2019
# We assume the data is in the working directory
oopsdata <- readMSnSet2(file = "oopsdata.csv", ecol = 2:21, fnames = 1)
It is first a good idea to get a feel for the data using some exploratory data analysis. Below we produce a heatmap. We can easily see from this data that there are some differences between the total and oops samples, as well as differences between the timepoints.
pheatmap(exprs(oopsdata), col = brewer.pal(n = 10, name = "PRGn"), cluster_cols = FALSE)
Now we perform PCA to see if there is an systematic clustering. There
are clear differences between the oops and total samples, as well
as a difference between 6h oops and 23h oops samples.
autoplot(prcomp(t(exprs(oopsdata))), label = TRUE) + theme_classic()
Now, we look at a global boxplot across the columns. There looks to fairly
consistent distributions of intensities across all the columns and perhaps
slightly higher variance for the oops samples
boxplot(exprs(oopsdata), col = rep(brewer.pal(n = 2, name = "Set2"), each = 10))
## Warning in brewer.pal(n = 2, name = "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
barplot(apply(exprs(oopsdata),2,var), col = rep(brewer.pal(n = 2, name = "Set2"), each = 10))
## Warning in brewer.pal(n = 2, name = "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
The first step is to establish what questions and aspects of the experimental design we are interested in modelling. Ultimately, we are interested in looking at RBPs binding RNA differentially. However, simply a change in abundance will make it appear like there is more RNA binding. We need to control simultaniously for the changes in abundance. There are also changes across the different times points. We also make replicate measurements and should allow some modelled variation for replicates. There are also TMT tags, we might expect some variation due to which TMT tag was used. The variance is also higher for the oops sample so it may be sensible to allow the variance to depend on the sample type.
We first buid the simplest model that answers the question we are interested in. We account for time and sample type (oops vs total) and the interaction between them. The interaction allows us to examine RBPs that bind RNA differentially. The model looks like following for each protein. \[ y = total + time + total*time \] We test using a single protein for simplicity
set.seed(1)
data <- data.frame(exprs(oopsdata)[sample.int(nrow(oopsdata), size = 1),])
colnames(data) <- "abundance"
data$type <- as.factor(rep(c("total", "oops"), each = 10))
data$time <- as.factor(rep(rep(c("0", "6", "23"), times = c(3,4,3)), 2))
data$replicate <- as.factor(sapply(strsplit(rownames(data), "_"), function(x)x[2]))
data$tag <- as.factor(rep(TMT10@reporterNames, times = 2))
Let us start the workflow by setting some priors and the model. In the code
chunk below, we pick some roughly arbitrary priors to get started. When
developing your own model it can be useful to use the get_prior function to
know which priors to set.
Below, we have asked the code just to return samples from the prior so we can see what effect they have in practice. We can then visualise some summaries of the data.
pr <- c(prior(normal(0,1), class = "b"),
prior(exponential(1), class = "sigma"))
fit_prot1 <- brm(abundance ~ 0 + type + time + (type:time),
data = data,
family = gaussian(link = "identity"),
sample_prior = "only",
prior = pr)
## Compiling Stan program...
## Start sampling
##
## SAMPLING FOR MODEL '95b1aa6f5505af21af888d28577bf886' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.031 seconds (Warm-up)
## Chain 1: 0.027 seconds (Sampling)
## Chain 1: 0.058 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL '95b1aa6f5505af21af888d28577bf886' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.029 seconds (Warm-up)
## Chain 2: 0.028 seconds (Sampling)
## Chain 2: 0.057 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL '95b1aa6f5505af21af888d28577bf886' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.035 seconds (Warm-up)
## Chain 3: 0.027 seconds (Sampling)
## Chain 3: 0.062 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL '95b1aa6f5505af21af888d28577bf886' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.031 seconds (Warm-up)
## Chain 4: 0.045 seconds (Sampling)
## Chain 4: 0.076 seconds (Total)
## Chain 4:
plot(fit_prot1, ask = FALSE)
summary(fit_prot1)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: abundance ~ 0 + type + time + (type:time)
## Data: data (Number of observations: 20)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## typeoops 0.00 1.00 -1.94 1.99 1.00 6472 2922
## typetotal 0.01 0.97 -1.87 1.89 1.00 6180 3166
## time23 -0.02 1.02 -2.03 2.03 1.00 6985 2957
## time6 0.02 0.97 -1.87 1.95 1.00 7031 3117
## typetotal:time23 0.01 1.04 -2.00 2.08 1.00 6071 2729
## typetotal:time6 0.01 1.00 -1.96 1.92 1.00 5986 2945
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 1.00 1.01 0.02 3.65 1.00 3703 1801
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
We make a few initial observation. For the moment is appears that the parameters are on a sensible scale. There are a few aspects we should be mindful of as we progress through the modelling. The first is the values for \(\sigma\), which can occasionally get unrealistically large. For these data standard deviations above \(2.5\) should be extremely rare, whist we see from the prior simulations that this can happen not infrequently. We also note, from examining the PCA plot, that the effect size for type (oops or total) is likely to be larger than that for time. Furthermore, it seems unlikely that we expect interaction effects to be on the same scale as type changes. One would expect that after account for the type and time that the interaction would have a smaller effect.
The next step in the modelling is a prior predictive check. Here, we produce some summary of the data and summaries of the data simulated from the prior predictive distribution. A huge number of different checks are available.
# compute kernel density estimate of data and simulated data and plot
pp_check(object = fit_prot1, "dens_overlay")
## Using 10 posterior samples for ppc type 'dens_overlay' by default.
# plot data in a histogram to see distributions
pp_check(object = fit_prot1, "hist")
## Using 10 posterior samples for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# plot data in a boxplot to see distribtuions
pp_check(object = fit_prot1, "boxplot")
## Using 10 posterior samples for ppc type 'boxplot' by default.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
# produce summary statistics
pp_check(object = fit_prot1, "stat_2d")
## Using all posterior samples for ppc type 'stat_2d' by default.
# produce intervals
pp_check(object = fit_prot1, "intervals")
## Using all posterior samples for ppc type 'intervals' by default.
# predctive errors
pp_check(object = fit_prot1, "error_hist")
## Using 10 posterior samples for ppc type 'error_hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These prior predictive checks suggest that the prior is centered correctly,
but is very diffuse. In principle, a diffuse prior might not be a problem if
the data are sufficiently informative. However, for proteomics data we are unlikely
to make more than 3 or 4 measurements and it maybe sensible to build a better
generative model before we move further into the modelling task.
Let’s us be a little more mathematical than our previous approach. The standard
deviations are unlikely to exceed \(2.5\) in this proteomics experiments. However,
an exponential(1) prior leads to standard deviations above 2.5 around 9% of the
time (see code chunk below). For us rare, would be 1 in 5000 proteins so this
is clearly miscalibrated with our expectations. We can see that an exponential
with rate of \(4\) capture this expectation.
# probability of observing sd above 2.5
sum(rexp(1000, rate = 1) > 2.5)/1000
## [1] 0.07
# with rate 4, probability of observing a sd above 2.5 is 1/5000
sum(rexp(10000, rate = 4) > 2.5)/10000
## [1] 1e-04
However, as we push the prior through the model is not clear how changing
the prior on the standard deviation will be reflected in generated data. We
can repeat our prior predictive checks. Below we fit the model without data
as before. Note that values for \sigma are now more aligned with our expectations.
pr <- c(prior(normal(0,1), class = "b"),
prior(exponential(4), class = "sigma"))
fit_prot1 <- brm(abundance ~ 0 + type + time + (type:time),
data = data,
family = gaussian(link = "identity"),
sample_prior = "only",
prior = pr)
## Compiling Stan program...
## Start sampling
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.028 seconds (Warm-up)
## Chain 1: 0.025 seconds (Sampling)
## Chain 1: 0.053 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.03 seconds (Warm-up)
## Chain 2: 0.027 seconds (Sampling)
## Chain 2: 0.057 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.029 seconds (Warm-up)
## Chain 3: 0.025 seconds (Sampling)
## Chain 3: 0.054 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.03 seconds (Warm-up)
## Chain 4: 0.027 seconds (Sampling)
## Chain 4: 0.057 seconds (Total)
## Chain 4:
plot(fit_prot1, ask = FALSE)
summary(fit_prot1)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: abundance ~ 0 + type + time + (type:time)
## Data: data (Number of observations: 20)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## typeoops -0.00 0.98 -1.90 1.97 1.00 6089 2964
## typetotal -0.00 0.98 -1.93 1.90 1.00 7521 3236
## time23 -0.01 1.01 -1.98 2.02 1.00 6306 2979
## time6 -0.01 0.99 -1.94 1.92 1.00 6427 2254
## typetotal:time23 0.00 0.99 -1.95 1.87 1.00 7048 2922
## typetotal:time6 0.02 1.03 -2.04 2.08 1.00 6634 2837
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.25 0.25 0.01 0.90 1.00 5027 2192
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
From the prior predictive checks below we see that the simulated data are much less diffuse in the observed data than before, whilst still being fairly broad in the range of data we could model.
# compute kernel density estimate of data and simulated data and plot
pp_check(object = fit_prot1, "dens_overlay")
## Using 10 posterior samples for ppc type 'dens_overlay' by default.
# plot data in a histogram to see distributions
pp_check(object = fit_prot1, "hist")
## Using 10 posterior samples for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# plot data in a boxplot to see distribtuions
pp_check(object = fit_prot1, "boxplot")
## Using 10 posterior samples for ppc type 'boxplot' by default.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
# produce summary statistics
pp_check(object = fit_prot1, "stat_2d")
## Using all posterior samples for ppc type 'stat_2d' by default.
# produce intervals
pp_check(object = fit_prot1, "intervals")
## Using all posterior samples for ppc type 'intervals' by default.
# predctive errors
pp_check(object = fit_prot1, "error_hist")
## Using 10 posterior samples for ppc type 'error_hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Fitting the model to the data
Let us now examine the model fitted to the data.
pr <- c(prior(normal(0,1), class = "b"),
prior(exponential(4), class = "sigma"))
fit_prot1_post<- brm(abundance ~ 0 + type + time + (type:time),
data = data,
family = gaussian(link = "identity"),
sample_prior = "no",
prior = pr)
## Compiling Stan program...
## recompiling to avoid crashing R session
## Start sampling
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.079 seconds (Warm-up)
## Chain 1: 0.066 seconds (Sampling)
## Chain 1: 0.145 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.074 seconds (Warm-up)
## Chain 2: 0.055 seconds (Sampling)
## Chain 2: 0.129 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.067 seconds (Warm-up)
## Chain 3: 0.08 seconds (Sampling)
## Chain 3: 0.147 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'f3ea8e7e93c0ee0812911bccee541a3c' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.072 seconds (Warm-up)
## Chain 4: 0.063 seconds (Sampling)
## Chain 4: 0.135 seconds (Total)
## Chain 4:
plot(fit_prot1_post, ask = FALSE)
summary(fit_prot1_post)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: abundance ~ 0 + type + time + (type:time)
## Data: data (Number of observations: 20)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## typeoops 0.08 0.06 -0.03 0.20 1.00 1645 1909
## typetotal 0.30 0.06 0.19 0.42 1.00 2009 1756
## time23 -0.28 0.08 -0.44 -0.12 1.00 1672 1775
## time6 0.07 0.07 -0.07 0.23 1.00 1650 2076
## typetotal:time23 0.13 0.11 -0.09 0.36 1.00 1625 2041
## typetotal:time6 -0.24 0.11 -0.45 -0.03 1.00 1524 1880
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.10 0.02 0.07 0.14 1.00 1881 2420
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Before, we dive into what the model is telling us we check that our Bayesian
computations have been faithful. Here Rhat is about \(1\) and we can assume
that it is unlikely there are any computational obstructions.
The above demonstrates the posterior distributions have clearly learnt something from the data. Let us have a look at one of the distributions for the parameters. The prior in orange is clearly spread out compared to the posterior distribution in green. It looks like the data was very informative of this parameter.
hist(posterior_samples(fit_prot1)[,1],
main = "Posterior Distribution type=oops",
breaks = 20, col = brewer.pal(n = 3, name = "Set2")[2], xlab = "estimate")
hist(posterior_samples(fit_prot1_post)[,1],
main = "Posterior Distribution type=oops",
breaks = 30, col = brewer.pal(n = 3, name = "Set2")[1], xlab = "estimate", add = TRUE)
We can compute the posterior contraction to how much concentration there is of
the posterior. The value is close to 1 suggesting the parameters is well-informed
by the data.
contraction <- 1 - (sd(posterior_samples(fit_prot1_post)[,1])/sd(posterior_samples(fit_prot1)[,1]))
contraction
## [1] 0.9412167
Now let us have a look at the posterior predictive checks. From each of the plots below we can see that the posterior predictive distribution matches the summaries of the data extremely well. This suggests that data generated by using the posterior estimates of the parameters look like the data we observed.
# compute kernel density estimate of data and simulated data and plot
pp_check(object = fit_prot1_post, "dens_overlay")
## Using 10 posterior samples for ppc type 'dens_overlay' by default.
# plot data in a histogram to see distributions
pp_check(object = fit_prot1_post, "hist")
## Using 10 posterior samples for ppc type 'hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# plot data in a boxplot to see distribtuions
pp_check(object = fit_prot1_post, "boxplot")
## Using 10 posterior samples for ppc type 'boxplot' by default.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
## notch went outside hinges. Try setting notch=FALSE.
# produce summary statistics
pp_check(object = fit_prot1_post, "stat_2d")
## Using all posterior samples for ppc type 'stat_2d' by default.
# produce intervals
pp_check(object = fit_prot1_post, "intervals")
## Using all posterior samples for ppc type 'intervals' by default.
# predctive errors
pp_check(object = fit_prot1_post, "error_hist")
## Using 10 posterior samples for ppc type 'error_hist' by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We notice that there is a small bump at the lower end of the first density plot.
We can generate more samples to see if this bump is something we can generate
from the data. From the plot below it shows we can generate similar to that
observed. However, there is a concern that simulated data is perhaps producing
data more extreme that that observed, as well as occasionally some unrealistic
looking data.
pp_check(object = fit_prot1_post, "dens_overlay", nsamples = 50)
# Using uncertainty
Before we investigate more complex models, we examine what the uncertainty can do for us to understand our model, inferences and problem. Recall, we are particularly interested in the direction between type and time. We first look at the interaction at 6 hours with type. As we can see from the plot the estimate for the interaction is mostly negative. This suggests that relative to the total sample the oops sample increased. This may lead us to believe that the proprotion of the protein that is RNA-bound is greater.
hist(posterior_samples(fit_prot1_post)[,6],
breaks = 30, col = brewer.pal(n = 3, name = "Set2")[1],
xlab = "estimate",
main = "Posterior estimates for interaction between total and 6 h")
Since we are working with probability distributions, we can simply calculate
the probability that this estimate is less than 0. We see this is \(0.98525\) and
we can also compute other probabilities as well. For example, we may only
be interested in meaningful effect sizes.
# Probability less than 0
sum(posterior_samples(fit_prot1_post)[,6] < 0)/length(posterior_samples(fit_prot1_post)[,6])
## [1] 0.98575
# Probability less than -0.1
sum(posterior_samples(fit_prot1_post)[,6] < -0.1)/length(posterior_samples(fit_prot1_post)[,6])
## [1] 0.90475
We can also examine the interaction effect at 23h, where we see the effect is in the opposite direction.
hist(posterior_samples(fit_prot1_post)[,5],
breaks = 30, col = brewer.pal(n = 3, name = "Set2")[2],
xlab = "estimate",
main = "Posterior estimates for interaction between total and 23 h")
We can calculate probabilistic quantities from this distribution and we can
see the effect is much weaker in this case. However, there is still some evidence
for an decrease in the proprotion of this protein binding RNA relative to total
protein at 23h post noczodole arrest.
# Probability greater than 0
sum(posterior_samples(fit_prot1_post)[,5] > 0)/length(posterior_samples(fit_prot1_post)[,5])
## [1] 0.889
# Probability greater than -0.1
sum(posterior_samples(fit_prot1_post)[,5] > 0.1)/length(posterior_samples(fit_prot1_post)[,5])
## [1] 0.61875
We can also ask more elaborate probabilistic questions of these data. We can ask the probability that the changes for the interaction terms were in the same direction at 6 hours and 23 hours. We see that the probability that the effect is in same direction is quite small and more likely the effects are in different directions.
probBothPositive <- sum(posterior_samples(fit_prot1_post)[,5] > 0 &
posterior_samples(fit_prot1_post)[,6] > 0)/4000
probBothNegative <- sum(posterior_samples(fit_prot1_post)[,5] < 0 &
posterior_samples(fit_prot1_post)[,6] < 0)/4000
probBothNegative + probBothPositive
## [1] 0.12525
We can also compute the probability that the effects are greater than \(0.1\) at both times in any direction. We can see we are quite uncertain as to whether the effects are both simultaneously greater than \(0.1\). Of course we can ask more complex probability questions in the same manner.
sum(abs(posterior_samples(fit_prot1_post)[,5]) > 0.1 &
abs(posterior_samples(fit_prot1_post)[,6]) > 0.1)/4000
## [1] 0.555
Credible interval can be obtained from the summary.
summary(fit_prot1_post)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: abundance ~ 0 + type + time + (type:time)
## Data: data (Number of observations: 20)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## typeoops 0.08 0.06 -0.03 0.20 1.00 1645 1909
## typetotal 0.30 0.06 0.19 0.42 1.00 2009 1756
## time23 -0.28 0.08 -0.44 -0.12 1.00 1672 1775
## time6 0.07 0.07 -0.07 0.23 1.00 1650 2076
## typetotal:time23 0.13 0.11 -0.09 0.36 1.00 1625 2041
## typetotal:time6 -0.24 0.11 -0.45 -0.03 1.00 1524 1880
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.10 0.02 0.07 0.14 1.00 1881 2420
##
## Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
This provides the basics of the workflow. We have a probabilistic model of which we can ask questions and quantify our uncertainties.